Bandwidth-starved multicore chips have become ubiquitous. It is well knownthat the performance of stencil codes can be improved by temporal blocking,lessening the pressure on the memory interface. We introduce a new pipelinedapproach that makes explicit use of shared caches in multicore environments andminimizes synchronization and boundary overhead. Benchmark results arepresented for three current x86-based microprocessors, showing clearly that ouroptimization works best on designs with high-speed shared caches and low memorybandwidth per core. We furthermore demonstrate that simple bandwidth-basedperformance models are inaccurate for this kind of algorithm and employ a moreelaborate, synthetic modeling procedure. Finally we show that temporal blockingcan be employed successfully in a hybrid shared/distributed-memory environment,albeit with limited benefit at strong scaling.
展开▼